Import the dataset

In this Notebook we are working with the UGR'16 dataset.
We will not use the entire dataset but just a sample of the July 2016 dataset. We have recovered one line out of 50 as the sample.

Dataset structure

We are going to analyse the structure of the dataset and if necessary reformat the dataset.

We can see that the dataset has no column name and that there are 10,789,816 rows and 13 columns.
In the page 8 section D. Dataset Preprocessing and Availability of the UGR’16 paper we can find the name of the columns.

URG'16

So we are going to rename the columns with the following names:

The column label indicates if there were an anomaly or not. There are 8 different values (background, blacklist, anomaly-spam, dos, scan11, scan44, nerisbotnet, anomaly-sshscan).

The attack labeled blacklist is a bit special. In this case, we recognize the attack based on a list of blacklisted Ips. Thus, we will remove this attack from the dataset.

We will consider the label background as normal and the others (anomaly-spam, dos, scan11, scan44, nerisbotnet, anomaly-sshscan) as anomalies.
Normal behaviours will have the value 0 and the normal ones the value 1.

Columns deletion

We can see that the column forwardingStatus has a unique value, so we can remove this column.

Also we are going to remove the srcIP and dstIP columns.

Transformation of the dateTime column

We are going to encode the dateTime column in order to use it.

Check if there are null values

We can see that there are no null values in the dataset.

Final dataset

So our final dataset has only 10 columns: dateTime, duration, srcPort, dstPort, protocol, flag, tos, packets, bytes, label

Data analysis

In this section we will analyse the dataset.

We can see that we have an unbalanced dataset. Indeed 98.9% of our data corresponds to a normal traffic and only 1.1% of the data to anomalies. The robustness of our model will be greatly impacted by its capacity to detect anomalies.


We will now compare the distribution of the data according to their label.

We can observe that more than 80% of the duration of the flow of the anomalies was 0 seconde

100% of the fraudulent traffic uses the protocols TCP, UDP and ICMP

We observe that 90% of the fraudulent traffic uses 5 different flags.

With only 5 combinations of flags, protocol and duration we can detect 82% of the fraudulent traffic.

We can observe that there is almost no fraudulent traffic between source ports 20000 and 32000.

We can observe that there is almost no fraudulent traffic between destination ports 20000 and 32000. Also the majority of destination ports of fraudulent behaviors are below 20000.

We can see that the time of fraudulent traffic if is very irregular.

Data visualisation

We are going to see if we can find some information about the dataset's structure by visualising it.

We can see a separation between fraudulent traffic and normal traffic

We can see that the normal traffic uses more combination of srcPort, dstPort.

Prepare data

We are going to prepare the data before training different ML models.

In this dataset there are two types of data :

Feature engineering

Prepare numerical data

We are going to normalize the numerical data.

Prepare categorical data

We are going to encode categorical data

There are many combination of the variables srcPort, dstPort, so we will just encode the values without using the one hot encoder.

For the variables protocol and flag we will use a one hot encoding

Machine Learning models

Isolation Forest

We can see with this model that we cannot classify correctly the fraudulent traffic.

One class SVM

We can see that the one class SVM has better results to classify fraudulent traffic compare to the Isolation Forest. However the classification of the non fraudulent traffic is worst.

XGBoost